{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> 1. Exploring natality dataset </h1>\n",
    "\n",
    "This notebook illustrates:\n",
    "<ol>\n",
    "<li> Exploring a BigQuery dataset using Datalab\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'cloud-training-demos-ml'   # CHANGE this to a globally unique value. Your project name is a good option to try.\n",
    "PROJECT = 'cloud-training-demos'     # CHANGE this to your project name\n",
    "REGION = 'us-central1'               # CHANGE this to one of the regions supported by Cloud AI Platform https://cloud.google.com/ml-engine/docs/tensorflow/regions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "if ! gcloud storage ls | grep -q gs://${BUCKET}/; then\n",
    "  gcloud storage buckets create --location=${REGION} gs://${BUCKET}\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Explore data </h2>\n",
    "\n",
    "The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that -- this way, twins born on the same day won't end up in different cuts of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create SQL query using natality data after the year 2000\n",
    "query = \"\"\"\n",
    "SELECT\n",
    "  weight_pounds,\n",
    "  is_male,\n",
    "  mother_age,\n",
    "  plurality,\n",
    "  gestation_weeks,\n",
    "  FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "FROM\n",
    "  publicdata.samples.natality\n",
    "WHERE year > 2000\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call BigQuery and examine in dataframe\n",
    "from google.cloud import bigquery\n",
    "df = bigquery.Client().query(query + \" LIMIT 100\").to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab Task #1\n",
    "Using the above code as an example, write a query to find the unique values for each of the columns and the count of those values for babies born after the year 2000.\n",
    "For example, we want to get these values:\n",
    "<pre>\n",
    "is_male\tnum_babies\tavg_wt\n",
    "False\t16245054\t7.104715\n",
    " True\t17026860\t7.349797\n",
    "</pre>\n",
    "This is important to ensure that we have enough examples of each data value, and to verify our hunch that the parameter has predictive value.\n",
    "\n",
    "Hint (highlight to see): <p style='color:white'>Use COUNT(), AVG() and GROUP BY. For example:\n",
    "<pre style='color:white'>\n",
    "SELECT\n",
    "  is_male,\n",
    "  COUNT(1) AS num_babies,\n",
    "  AVG(weight_pounds) AS avg_wt\n",
    "FROM\n",
    "  publicdata.samples.natality\n",
    "WHERE\n",
    "  year > 2000\n",
    "GROUP BY\n",
    "  is_male\n",
    "</pre>\n",
    "</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab Task #2\n",
    "\n",
    "Which factors seem to play a part in the baby's weight?\n",
    "\n",
    "<b>Bonus:</b> Draw graphs to illustrate your conclusions\n",
    "\n",
    "Hint (highlight to see): <p style='color:white'> The simplest way to plot is to use Pandas' built-in plotting capability\n",
    "<pre style='color:white'>\n",
    "from google.cloud import bigquery\n",
    "df = bigquery.Client().query(query).to_dataframe()\n",
    "df.plot(x='is_male', y='num_babies', logy=True, kind='bar');\n",
    "df.plot(x='is_male', y='avg_wt', kind='bar');\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
